This data set contains 113,937 loans with 81 variables on each loan, including loan amount, interest rate, current loan status, borrower income, and many others. The analysis in this part would be structured to provide simple univariate relationships to multivariate relationships, this research would provide answers to questions like whether the monthly loan payment has a correlation or any relationship between loan original,amount, what is the spread of lterm of loan in loan status, identifying the frequency of the categorical variables; Term of loan, borrower's employment status, year of loan, and loan status, are there differences between loans depending on how the loan term large the original loan amount was. We have 81 features attributed to each record of loans in the dataset and the descriptions of the features are detailed below:
- ListingKey Unique key for each listing, same value as the 'key' used in the listing object in the API.
- ListingNumber The number that uniquely identifies the listing to the public as displayed on the website. -ListingCreationDate The date the listing was created. -CreditGrade The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings. -Term The length of the loan expressed in months. -LoanStatus The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, ---- -PastDue. The PastDue status will be accompanied by a delinquency bucket. -ClosedDate Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses. -BorrowerAPR The Borrower's Annual Percentage Rate (APR) for the loan. -BorrowerRate The Borrower's interest rate for this loan. -LenderYield The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee. -EstimatedEffectiveYield Effective yield is equal to the borrower interest rate (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. Applicable for loans originated after July 2009. -EstimatedLoss Estimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009. -EstimatedReturn The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009. -ProsperRating (numeric) The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009. -ProsperRating (Alpha) The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009. -ProsperScore A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009. -ListingCategory The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - --Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans -BorrowerState The two letter abbreviation of the state of the address of the borrower at the time the Listing was created. -Occupation The Occupation selected by the Borrower at the time they created the listing. -EmploymentStatus The employment status of the borrower at the time they posted the listing. -EmploymentStatusDuration The length in months of the employment status at the time the listing was created. -IsBorrowerHomeowner A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner. -CurrentlyInGroup Specifies whether or not the Borrower was in a group at the time the listing was created. -GroupKey The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation. -DateCreditPulled The date the credit profile was pulled. -CreditScoreRangeLower The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency. -CreditScoreRangeUpper The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency. -FirstRecordedCreditLine The date the first credit line was opened. -CurrentCreditLines Number of current credit lines at the time the credit profile was pulled. -OpenCreditLines Number of open credit lines at the time the credit profile was pulled. -TotalCreditLinespast7years Number of credit lines in the past seven years at the time the credit profile was pulled. -OpenRevolvingAccounts Number of open revolving accounts at the time the credit profile was pulled. -OpenRevolvingMonthlyPayment Monthly payment on revolving accounts at the time the credit profile was pulled. -InquiriesLast6Months Number of inquiries in the past six months at the time the credit profile was pulled. -TotalInquiries Total number of inquiries at the time the credit profile was pulled. -CurrentDelinquencies Number of accounts delinquent at the time the credit profile was pulled. -AmountDelinquent Dollars delinquent at the time the credit profile was pulled. -DelinquenciesLast7Years Number of delinquencies in the past 7 years at the time the credit profile was pulled. -PublicRecordsLast10Years Number of public records in the past 10 years at the time the credit profile was pulled. -PublicRecordsLast12Months Number of public records in the past 12 months at the time the credit profile was pulled. -RevolvingCreditBalance Dollars of revolving credit at the time the credit profile was pulled. -BankcardUtilization The percentage of available revolving credit that is utilized at the time the credit profile was pulled. -AvailableBankcardCredit The total available credit via bank card at the time the credit profile was pulled. -TotalTrades Number of trade lines ever opened at the time the credit profile was pulled. -TradesNeverDelinquent Number of trades that have never been delinquent at the time the credit profile was pulled. -TradesOpenedLast6Months Number of trades opened in the last 6 months at the time the credit profile was pulled. -DebtToIncomeRatio The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%). -IncomeRange The income range of the borrower at the time the listing was created. -IncomeVerifiable The borrower indicated they have the required documentation to support their income. -StatedMonthlyIncome The monthly income the borrower stated at the time the listing was created. -LoanKey Unique key for each loan. This is the same key that is used in the API. -TotalProsperLoans Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans. -TotalProsperPaymentsBilled Number of on time payments the borrower made on Prosper loans at the time they created this listing. This value will be null if the borrower had no prior loans. -OnTimeProsperPayments Number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans. -ProsperPaymentsLessThanOneMonthLate Number of payments the borrower made on Prosper loans that were less than one month late at the time they created this listing. This value will be null if the borrower had no prior loans. -ProsperPaymentsOneMonthPlusLate Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans. -ProsperPrincipalBorrowed Total principal borrowed on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans. -ProsperPrincipalOutstanding Principal outstanding on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans. -ScorexChangeAtTimeOfListing Borrower's credit score change at the time the credit profile was pulled. This will be the change relative to the borrower's last Prosper loan. This value will be null if the borrower had no prior loans. -LoanCurrentDaysDelinquent The number of days delinquent. -LoanFirstDefaultedCycleNumber The cycle the loan was charged off. If the loan has not charged off the value will be null. -LoanMonthsSinceOrigination Number of months since the loan originated. -LoanNumber Unique numeric value associated with the loan. -LoanOriginalAmount The origination amount of the loan. -LoanOriginationDate The date the loan was originated. -LoanOriginationQuarter The quarter in which the loan was originated. -MemberKey The unique key that is associated with the borrower. This is the same identifier that is used in the API member object. -MonthlyLoanPayment The scheduled monthly loan payment. -LP_CustomerPayments Pre charge-off cumulative gross payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries. -LP_CustomerPrincipalPayments Pre charge-off cumulative principal payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries. -LP_InterestandFees Pre charge-off cumulative interest and fees paid by the borrower. If the loan has charged off, this value will exclude any recoveries. -LP_ServiceFees Cumulative service fees paid by the investors who have invested in the loan. -LP_CollectionFees Cumulative collection fees paid by the investors who have invested in the loan. -LP_GrossPrincipalLoss The gross charged off amount of the loan. -LP_NetPrincipalLoss The principal that remains uncollected after any recoveries. -LP_NonPrincipalRecoverypayments The interest and fee component of any recovery payments. The current payment policy applies payments in the following order: Fees, interest, principal. -PercentFunded Percent the listing was funded. -Recommendations Number of recommendations the borrower had at the time the listing was created. -InvestmentFromFriendsCount Number of friends that made an investment in the loan. -InvestmentFromFriendsAmount Dollar amount of investments that were made by friends. -Investors The number of investors that funded the loan.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
sns.set_palette("Set2", 8, .75)
%matplotlib inline
# load dataset
loan_df = pd.read_csv('prosperLoanData.csv')
# call the df
loan_df.head()
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
5 rows × 81 columns
# columns in the df
loan_df.columns
Index(['ListingKey', 'ListingNumber', 'ListingCreationDate', 'CreditGrade',
'Term', 'LoanStatus', 'ClosedDate', 'BorrowerAPR', 'BorrowerRate',
'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss',
'EstimatedReturn', 'ProsperRating (numeric)', 'ProsperRating (Alpha)',
'ProsperScore', 'ListingCategory (numeric)', 'BorrowerState',
'Occupation', 'EmploymentStatus', 'EmploymentStatusDuration',
'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey',
'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper',
'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines',
'TotalCreditLinespast7years', 'OpenRevolvingAccounts',
'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries',
'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years',
'PublicRecordsLast10Years', 'PublicRecordsLast12Months',
'RevolvingCreditBalance', 'BankcardUtilization',
'AvailableBankcardCredit', 'TotalTrades',
'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months',
'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable',
'StatedMonthlyIncome', 'LoanKey', 'TotalProsperLoans',
'TotalProsperPaymentsBilled', 'OnTimeProsperPayments',
'ProsperPaymentsLessThanOneMonthLate',
'ProsperPaymentsOneMonthPlusLate', 'ProsperPrincipalBorrowed',
'ProsperPrincipalOutstanding', 'ScorexChangeAtTimeOfListing',
'LoanCurrentDaysDelinquent', 'LoanFirstDefaultedCycleNumber',
'LoanMonthsSinceOrigination', 'LoanNumber', 'LoanOriginalAmount',
'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey',
'MonthlyLoanPayment', 'LP_CustomerPayments',
'LP_CustomerPrincipalPayments', 'LP_InterestandFees', 'LP_ServiceFees',
'LP_CollectionFees', 'LP_GrossPrincipalLoss', 'LP_NetPrincipalLoss',
'LP_NonPrincipalRecoverypayments', 'PercentFunded', 'Recommendations',
'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount',
'Investors'],
dtype='object')
# structure of df
shape_loan = loan_df.shape
print('The struture of the loan dataset is {}'.format(shape_loan))
The struture of the loan dataset is (113937, 81)
# info of the df
loan_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
# check for duplicated
loan_df[loan_df.duplicated()]
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors |
|---|
0 rows × 81 columns
# check null values in colmun LoanStatus
loan_df[loan_df.LoanStatus.isnull()]
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors |
|---|
0 rows × 81 columns
# loan term in months value count
loan_df.Term.value_counts()
36 87778 60 24545 12 1614 Name: Term, dtype: int64
# check data type
loan_df.Term.dtype
dtype('int64')
The term of loan expressed in months reveals that the column data type should be categorical, this would be converted from integer to categorical, also the numerical value would be renamed to a categorical term.
# replace the numerical value to categorical term
loan_df.Term = loan_df.Term.map({36: 'Medium Term', 60: 'Long Term', 12: 'Short Term'})
# check term value counts
loan_df.Term.value_counts()
Medium Term 87778 Long Term 24545 Short Term 1614 Name: Term, dtype: int64
# convert to categorical data type
loan_df.Term = loan_df.Term.astype('category')
# check data type
loan_df.Term.dtype
CategoricalDtype(categories=['Long Term', 'Medium Term', 'Short Term'], ordered=False)
# value count of Loan Status column
loan_df.LoanStatus.value_counts()
Current 56576 Completed 38074 Chargedoff 11992 Defaulted 5018 Past Due (1-15 days) 806 Past Due (31-60 days) 363 Past Due (61-90 days) 313 Past Due (91-120 days) 304 Past Due (16-30 days) 265 FinalPaymentInProgress 205 Past Due (>120 days) 16 Cancelled 5 Name: LoanStatus, dtype: int64
Loan status values reveals that the column data type should be categorical, this would be effected in the following line of code.
# change data type of Loan Status column
loan_df.LoanStatus = loan_df.LoanStatus.astype('category')
# check the dtype
loan_df.LoanStatus.dtype
CategoricalDtype(categories=['Cancelled', 'Chargedoff', 'Completed', 'Current',
'Defaulted', 'FinalPaymentInProgress',
'Past Due (1-15 days)', 'Past Due (16-30 days)',
'Past Due (31-60 days)', 'Past Due (61-90 days)',
'Past Due (91-120 days)', 'Past Due (>120 days)'],
, ordered=False)
The past due values in days would have a generic value Past Due, this would be replaced with the following line of code
# replace to past due all past
# due day regardless of the number of days
loan_df.LoanStatus = loan_df.LoanStatus.replace({'Past Due (1-15 days)':'Past Due', 'Past Due (16-30 days)':'Past Due',
'Past Due (31-60 days)':'Past Due', 'Past Due (61-90 days)' :'Past Due',
'Past Due (91-120 days)' :'Past Due', 'Past Due (>120 days)' :'Past Due'})
# check the values
loan_df.LoanStatus
0 Completed
1 Current
2 Completed
3 Current
4 Current
...
113932 Current
113933 FinalPaymentInProgress
113934 Current
113935 Completed
113936 Current
Name: LoanStatus, Length: 113937, dtype: category
Categories (7, object): ['Cancelled', 'Chargedoff', 'Completed', 'Current', 'Defaulted', 'FinalPaymentInProgress', 'Past Due']
# check values of home ower column
loan_df.IsBorrowerHomeowner.value_counts()
True 57478 False 56459 Name: IsBorrowerHomeowner, dtype: int64
# date loan was originated
loan_df.LoanOriginationDate.value_counts()
2014-01-22 00:00:00 491
2013-11-13 00:00:00 490
2014-02-19 00:00:00 439
2013-10-16 00:00:00 434
2014-01-28 00:00:00 339
...
2005-12-30 00:00:00 1
2005-12-12 00:00:00 1
2006-02-10 00:00:00 1
2008-10-17 00:00:00 1
2006-01-31 00:00:00 1
Name: LoanOriginationDate, Length: 1873, dtype: int64
Convert the loan origination date column from object to datetime and extracting the year out of the column for analysis
# convert to datetime
loan_df.LoanOriginationDate = pd.to_datetime(loan_df.LoanOriginationDate)
# check the datatype
loan_df.LoanOriginationDate.dtype
dtype('<M8[ns]')
# extract year from the column
loan_df['Year'] = pd.DatetimeIndex(loan_df['LoanOriginationDate']).year
# check the extracted column
loan_df.Year
0 2007
1 2014
2 2007
3 2012
4 2013
...
113932 2013
113933 2011
113934 2013
113935 2011
113936 2014
Name: Year, Length: 113937, dtype: int64
# check data type
loan_df.Year.dtype
dtype('int64')
# convert year to categorical variable
loan_df.Year = loan_df.Year.astype('category')
# check data type
loan_df.Year.dtype
CategoricalDtype(categories=[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014], ordered=False)
# value count of verfiable income
loan_df.IncomeVerifiable.value_counts()
True 105268 False 8669 Name: IncomeVerifiable, dtype: int64
# number of recommedations values
loan_df.Recommendations.value_counts()
0 109678 1 3516 2 568 3 108 4 26 5 14 9 6 7 5 6 4 8 3 16 2 18 2 14 1 19 1 21 1 24 1 39 1 Name: Recommendations, dtype: int64
# emolyment status values
loan_df.EmploymentStatus.value_counts()
Employed 67322 Full-time 26355 Self-employed 6134 Not available 5347 Other 3806 Part-time 1088 Not employed 835 Retired 795 Name: EmploymentStatus, dtype: int64
# sum of missing values
loan_df.EmploymentStatus.isnull().sum()
2255
# EmploymentStatus missing values records
loan_df[loan_df.EmploymentStatus.isnull()]
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | 109D3366077649289619396 | 24135 | 2006-07-11 17:14:14.647000000 | B | Medium Term | Completed | 2009-10-22 00:00:00 | 0.15211 | 0.1450 | 0.1400 | ... | -299.78 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 44 | 2006 |
| 92 | 0FE433661215082408720F0 | 22591 | 2006-07-02 11:05:02.857000000 | AA | Medium Term | Completed | 2007-03-01 00:00:00 | 0.08292 | 0.0755 | 0.0705 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 21 | 2006 |
| 146 | 34B03365511229087456DB5 | 31745 | 2006-08-16 09:43:10.867000000 | A | Medium Term | Completed | 2006-12-04 00:00:00 | 0.09939 | 0.0925 | 0.0850 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 67 | 2006 |
| 162 | 10463364871408702EBC041 | 17578 | 2006-06-03 07:46:36.343000000 | AA | Medium Term | Completed | 2007-01-09 00:00:00 | 0.08483 | 0.0780 | 0.0730 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 93 | 2006 |
| 193 | 0F6133644711535583E92C3 | 17091 | 2006-05-31 23:48:17.177000000 | E | Medium Term | Defaulted | 2007-10-08 00:00:00 | 0.23937 | 0.2300 | 0.2250 | ... | 0.00 | 1590.40 | 1584.56 | 0.0 | 1.0 | 0 | 0 | 0.0 | 13 | 2006 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113663 | DBD7336509210597772988B | 21182 | 2006-06-22 12:04:17.483000000 | HR | Medium Term | Chargedoff | 2008-12-06 00:00:00 | 0.29525 | 0.2875 | 0.2450 | ... | 0.00 | 1081.21 | 1081.22 | 0.0 | 1.0 | 0 | 0 | 0.0 | 22 | 2006 |
| 113712 | DC1033648016334634330F6 | 18121 | 2006-06-06 15:08:32.807000000 | HR | Medium Term | Defaulted | 2007-05-16 00:00:00 | 0.24502 | 0.2375 | 0.2325 | ... | -93.79 | 5315.12 | 5340.58 | 0.0 | 1.0 | 0 | 0 | 0.0 | 51 | 2006 |
| 113902 | E6D13366465509973F5E90A | 3579 | 2006-03-26 00:11:04.620000000 | A | Medium Term | Completed | 2006-05-24 00:00:00 | 0.08805 | 0.0812 | 0.0750 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 143 | 2006 |
| 113914 | D20533652658825353F6597 | 30223 | 2006-08-09 14:34:40.010000000 | HR | Medium Term | Defaulted | 2007-03-19 00:00:00 | 0.25757 | 0.2500 | 0.2450 | ... | 0.00 | 2883.01 | 2883.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 | 2006 |
| 113921 | D21A33647075665665BE266 | 4 | 2005-11-09 20:44:28.847000000 | AA | Medium Term | Completed | 2005-11-25 00:00:00 | NaN | 0.0400 | 0.0350 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 | 2005 |
2255 rows × 82 columns
# % of missing values
loan_df.EmploymentStatus.isnull().mean() * 100
1.9791639239228696
Employment status values reveals that the column data type should be categorical with a 2255 sum of missing values representing 1.98% of missing values in the colum, replacing the missing value with the most occured employment status and changing the data type will suffice.
# replace missing value with most occured value
loan_df.EmploymentStatus.fillna('Employed', inplace = True)
# check for missing values
loan_df[loan_df.EmploymentStatus.isnull()]
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | Year |
|---|
0 rows × 82 columns
# change data type to categorical
loan_df.EmploymentStatus = loan_df.EmploymentStatus.astype('category')
# check data type
loan_df.EmploymentStatus.dtype
CategoricalDtype(categories=['Employed', 'Full-time', 'Not available', 'Not employed',
'Other', 'Part-time', 'Retired', 'Self-employed'],
, ordered=False)
# number of days of loan delinquency values
loan_df.LoanCurrentDaysDelinquent.value_counts()
0 94860
121 293
10 192
4 77
11 62
...
948 1
2313 1
2121 1
1479 1
1207 1
Name: LoanCurrentDaysDelinquent, Length: 2411, dtype: int64
# sorted to highest days of delinquency
loan_df.LoanCurrentDaysDelinquent.sort_values(ascending = False, ignore_index= True)
0 2704
1 2703
2 2613
3 2612
4 2599
...
113932 0
113933 0
113934 0
113935 0
113936 0
Name: LoanCurrentDaysDelinquent, Length: 113937, dtype: int64
# stated monthly income
loan_df.StatedMonthlyIncome
0 3083.333333
1 6125.000000
2 2083.333333
3 2875.000000
4 9583.333333
...
113932 4333.333333
113933 8041.666667
113934 2875.000000
113935 3875.000000
113936 4583.333333
Name: StatedMonthlyIncome, Length: 113937, dtype: float64
The stated monthly income is captured in 6 decimal places float data type, converting the datatype from flaot to integer would suffice.
# round up to two decimal
loan_df.StatedMonthlyIncome = loan_df.StatedMonthlyIncome.astype(int)
loan_df.StatedMonthlyIncome.dtype
dtype('int32')
# check the stated income monthly
loan_df.StatedMonthlyIncome
0 3083
1 6125
2 2083
3 2875
4 9583
...
113932 4333
113933 8041
113934 2875
113935 3875
113936 4583
Name: StatedMonthlyIncome, Length: 113937, dtype: int32
# sorted from highest monthly income
loan_df.StatedMonthlyIncome.sort_values(ascending = False, ignore_index= True)
0 1750002
1 618547
2 483333
3 466666
4 416666
...
113932 0
113933 0
113934 0
113935 0
113936 0
Name: StatedMonthlyIncome, Length: 113937, dtype: int32
# sorted from highest loan
loan_df.LoanOriginalAmount.sort_values(ascending = False, ignore_index= True)
0 35000
1 35000
2 35000
3 35000
4 35000
...
113932 1000
113933 1000
113934 1000
113935 1000
113936 1000
Name: LoanOriginalAmount, Length: 113937, dtype: int64
# monthly loan payment data type
loan_df.MonthlyLoanPayment.dtype
dtype('float64')
# convert from float to integer
loan_df.MonthlyLoanPayment = loan_df.MonthlyLoanPayment.astype(int)
# check the data type
loan_df.MonthlyLoanPayment.dtype
dtype('int32')
# sorted from highest monthly loan payment
loan_df.MonthlyLoanPayment.sort_values(ascending = False, ignore_index= True)
0 2251
1 2218
2 2179
3 2163
4 2153
...
113932 0
113933 0
113934 0
113935 0
113936 0
Name: MonthlyLoanPayment, Length: 113937, dtype: int32
# sum of missing values in occupation column
loan_df.Occupation.isnull().sum()
3588
# null occupation records
loan_df[loan_df.Occupation.isnull()]
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 34 | 0F6C3603015887476F3F015 | 1180690 | 2014-02-24 12:55:20.430000000 | NaN | Medium Term | Current | NaN | 0.22966 | 0.1920 | 0.1820 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 | 2014 |
| 39 | 109D3366077649289619396 | 24135 | 2006-07-11 17:14:14.647000000 | B | Medium Term | Completed | 2009-10-22 00:00:00 | 0.15211 | 0.1450 | 0.1400 | ... | -299.78 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 44 | 2006 |
| 92 | 0FE433661215082408720F0 | 22591 | 2006-07-02 11:05:02.857000000 | AA | Medium Term | Completed | 2007-03-01 00:00:00 | 0.08292 | 0.0755 | 0.0705 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 21 | 2006 |
| 146 | 34B03365511229087456DB5 | 31745 | 2006-08-16 09:43:10.867000000 | A | Medium Term | Completed | 2006-12-04 00:00:00 | 0.09939 | 0.0925 | 0.0850 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 67 | 2006 |
| 161 | 10453596739685417502B89 | 1086023 | 2013-12-20 14:57:44.580000000 | NaN | Medium Term | Current | NaN | 0.17151 | 0.1355 | 0.1255 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 2 | 2013 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113709 | DC043594246199576475BB4 | 1017315 | 2013-11-05 05:44:05.173000000 | NaN | Medium Term | Current | NaN | 0.33215 | 0.2925 | 0.2825 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 | 2013 |
| 113712 | DC1033648016334634330F6 | 18121 | 2006-06-06 15:08:32.807000000 | HR | Medium Term | Defaulted | 2007-05-16 00:00:00 | 0.24502 | 0.2375 | 0.2325 | ... | -93.79 | 5315.12 | 5340.58 | 0.0 | 1.0 | 0 | 0 | 0.0 | 51 | 2006 |
| 113902 | E6D13366465509973F5E90A | 3579 | 2006-03-26 00:11:04.620000000 | A | Medium Term | Completed | 2006-05-24 00:00:00 | 0.08805 | 0.0812 | 0.0750 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 143 | 2006 |
| 113914 | D20533652658825353F6597 | 30223 | 2006-08-09 14:34:40.010000000 | HR | Medium Term | Defaulted | 2007-03-19 00:00:00 | 0.25757 | 0.2500 | 0.2450 | ... | 0.00 | 2883.01 | 2883.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 | 2006 |
| 113921 | D21A33647075665665BE266 | 4 | 2005-11-09 20:44:28.847000000 | AA | Medium Term | Completed | 2005-11-25 00:00:00 | NaN | 0.0400 | 0.0350 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 | 2005 |
3588 rows × 82 columns
# & of missing values in occupation
loan_df.Occupation.isnull().mean()* 100
3.149108717975723
Occupation values reveals that the column data type should be categorical with a sum of 3588 missing values representing 3.15% of missing values in the column, the null values in occupation column would be replaced with the key word unidentified afterwards the data type would be changed to categorical.
# value count of occupation
loan_df.Occupation.value_counts()
Other 28617
Professional 13628
Computer Programmer 4478
Executive 4311
Teacher 3759
...
Dentist 68
Student - College Freshman 41
Student - Community College 28
Judge 22
Student - Technical School 16
Name: Occupation, Length: 67, dtype: int64
# list of all occuption recorded
list(loan_df.Occupation)
['Other', 'Professional', 'Other', 'Skilled Labor', 'Executive', 'Professional', 'Sales - Retail', 'Laborer', 'Food Service', 'Food Service', 'Fireman', 'Waiter/Waitress', 'Sales - Retail', 'Construction', 'Computer Programmer', 'Other', 'Professional', 'Professional', 'Sales - Commission', 'Laborer', 'Retail Management', 'Professional', 'Other', 'Skilled Labor', 'Other', 'Engineer - Mechanical', 'Sales - Commission', 'Executive', 'Military Enlisted', 'Other', 'Other', 'Other', 'Clerical', 'Other', nan, 'Retail Management', 'Professional', 'Teacher', 'Other', nan, 'Other', 'Other', 'Other', 'Clergy', 'Professional', 'Executive', 'Accountant/CPA', 'Professional', 'Attorney', 'Professional', 'Nurse (RN)', 'Nurse (RN)', 'Other', 'Accountant/CPA', 'Executive', 'Retail Management', 'Retail Management', 'Construction', 'Clergy', 'Other', 'Teacher', 'Analyst', 'Other', 'Analyst', 'Sales - Commission', 'Other', 'Other', 'Professional', "Nurse's Aide", 'Investor', 'Retail Management', 'Other', 'Other', 'Other', 'Fireman', 'Skilled Labor', 'Executive', 'Analyst', 'Realtor', 'Skilled Labor', 'Other', 'Professional', 'Other', 'Accountant/CPA', 'Flight Attendant', 'Other', 'Nurse (LPN)', 'Professional', 'Construction', 'Construction', 'Clerical', 'Other', nan, 'Laborer', 'Military Officer', 'Analyst', 'Clerical', 'Sales - Retail', 'Computer Programmer', 'Other', 'Other', 'Food Service Management', 'Professional', 'Truck Driver', 'Other', 'Sales - Commission', 'Other', 'Administrative Assistant', 'Executive', 'Police Officer/Correction Officer', 'Nurse (RN)', 'Laborer', 'Social Worker', 'Other', 'Computer Programmer', 'Professional', 'Executive', 'Professional', 'Teacher', 'Clerical', 'Laborer', 'Military Officer', 'Other', 'Sales - Retail', 'Construction', 'Other', 'Sales - Commission', 'Administrative Assistant', 'Other', 'Retail Management', 'Food Service Management', 'Other', 'Military Enlisted', 'Other', 'Professional', 'Skilled Labor', 'Sales - Commission', 'Construction', 'Military Enlisted', 'Other', 'Accountant/CPA', 'Truck Driver', 'Analyst', 'Fireman', 'Sales - Commission', 'Tradesman - Mechanic', nan, 'Professional', 'Accountant/CPA', 'Professional', 'Professional', 'Executive', 'Skilled Labor', 'Skilled Labor', 'Teacher', 'Computer Programmer', 'Professional', 'Food Service', 'Tradesman - Mechanic', 'Analyst', 'Military Officer', nan, nan, 'Other', 'Other', 'Other', 'Executive', 'Medical Technician', 'Executive', 'Professor', 'Other', 'Tradesman - Mechanic', 'Other', 'Skilled Labor', 'Teacher', 'Postal Service', 'Sales - Commission', 'Social Worker', 'Other', 'Postal Service', 'Other', 'Sales - Retail', 'Computer Programmer', 'Attorney', 'Military Enlisted', 'Teacher', 'Other', 'Other', 'Other', 'Sales - Commission', 'Professional', 'Analyst', 'Military Enlisted', nan, 'Other', 'Nurse (RN)', 'Other', 'Waiter/Waitress', 'Professional', 'Civil Service', 'Executive', 'Other', 'Medical Technician', 'Other', 'Teacher', 'Other', nan, 'Retail Management', 'Other', 'Accountant/CPA', 'Analyst', 'Other', 'Professional', 'Sales - Commission', 'Professional', 'Realtor', 'Investor', 'Food Service', 'Other', 'Professional', 'Police Officer/Correction Officer', 'Other', 'Pharmacist', 'Administrative Assistant', 'Sales - Commission', 'Nurse (RN)', 'Other', 'Professional', 'Truck Driver', nan, 'Other', 'Skilled Labor', 'Other', 'Other', 'Nurse (RN)', 'Truck Driver', 'Tradesman - Mechanic', nan, 'Other', 'Professional', 'Laborer', 'Administrative Assistant', 'Professional', 'Construction', 'Truck Driver', 'Computer Programmer', 'Pharmacist', nan, 'Administrative Assistant', 'Teacher', nan, 'Other', 'Clerical', 'Other', 'Other', 'Computer Programmer', 'Computer Programmer', 'Clerical', 'Administrative Assistant', 'Sales - Retail', 'Professional', 'Other', 'Police Officer/Correction Officer', 'Professional', 'Food Service Management', 'Retail Management', 'Other', 'Other', 'Other', 'Other', 'Other', 'Computer Programmer', 'Tradesman - Electrician', 'Other', 'Computer Programmer', 'Professional', 'Other', 'Other', 'Truck Driver', 'Scientist', 'Engineer - Mechanical', 'Skilled Labor', 'Other', 'Teacher', 'Accountant/CPA', 'Retail Management', 'Accountant/CPA', 'Dentist', 'Engineer - Electrical', 'Other', 'Skilled Labor', 'Sales - Commission', 'Professional', 'Clerical', 'Retail Management', 'Other', 'Administrative Assistant', 'Other', 'Retail Management', 'Teacher', 'Computer Programmer', 'Civil Service', 'Sales - Retail', 'Architect', 'Professional', 'Other', 'Other', 'Professional', 'Landscaping', 'Analyst', nan, 'Nurse (RN)', 'Administrative Assistant', 'Professional', 'Executive', 'Landscaping', 'Administrative Assistant', 'Professional', 'Clerical', 'Executive', 'Construction', 'Professional', 'Analyst', 'Executive', 'Laborer', nan, 'Sales - Retail', 'Professional', 'Nurse (LPN)', 'Waiter/Waitress', 'Executive', 'Other', nan, 'Flight Attendant', 'Executive', 'Other', 'Professional', 'Retail Management', 'Tradesman - Carpenter', 'Other', 'Waiter/Waitress', 'Administrative Assistant', 'Other', 'Tradesman - Electrician', 'Medical Technician', 'Other', 'Retail Management', 'Truck Driver', 'Analyst', nan, 'Skilled Labor', 'Sales - Commission', 'Executive', 'Professional', 'Truck Driver', 'Police Officer/Correction Officer', nan, 'Civil Service', 'Professional', 'Other', 'Professional', 'Clerical', 'Accountant/CPA', 'Executive', 'Bus Driver', 'Professional', 'Sales - Commission', 'Sales - Retail', 'Sales - Retail', 'Food Service', 'Skilled Labor', 'Computer Programmer', 'Postal Service', 'Other', 'Executive', 'Other', 'Nurse (LPN)', 'Analyst', 'Medical Technician', 'Other', 'Professional', 'Other', 'Engineer - Mechanical', 'Teacher', 'Tradesman - Plumber', 'Retail Management', 'Professional', nan, 'Construction', 'Clerical', 'Professional', 'Other', 'Other', 'Professional', 'Computer Programmer', 'Other', 'Tradesman - Electrician', 'Professional', 'Professional', 'Administrative Assistant', 'Executive', 'Computer Programmer', 'Other', 'Other', 'Bus Driver', 'Teacher', 'Architect', 'Computer Programmer', 'Executive', 'Other', 'Other', 'Engineer - Chemical', 'Computer Programmer', 'Other', 'Computer Programmer', 'Professional', 'Food Service', 'Other', 'Doctor', 'Other', 'Construction', 'Other', 'Analyst', 'Executive', 'Other', 'Administrative Assistant', 'Chemist', 'Student - College Senior', 'Other', 'Principal', 'Clerical', 'Laborer', 'Other', 'Other', 'Civil Service', 'Skilled Labor', 'Executive', 'Professional', 'Attorney', 'Professional', 'Professional', 'Other', 'Other', 'Sales - Commission', 'Other', 'Accountant/CPA', 'Social Worker', 'Sales - Retail', nan, 'Truck Driver', 'Skilled Labor', 'Administrative Assistant', 'Scientist', 'Medical Technician', 'Computer Programmer', 'Sales - Commission', 'Computer Programmer', 'Military Enlisted', 'Engineer - Mechanical', 'Other', 'Doctor', 'Civil Service', 'Skilled Labor', 'Doctor', 'Sales - Retail', "Teacher's Aide", 'Police Officer/Correction Officer', 'Skilled Labor', 'Other', 'Professional', 'Professional', 'Executive', nan, 'Landscaping', 'Other', 'Other', 'Other', 'Other', 'Teacher', 'Other', 'Waiter/Waitress', 'Other', 'Administrative Assistant', 'Other', 'Professional', 'Military Enlisted', 'Nurse (RN)', 'Analyst', 'Skilled Labor', 'Other', 'Other', 'Analyst', 'Pilot - Private/Commercial', 'Other', 'Other', 'Other', 'Executive', 'Other', 'Professional', 'Engineer - Mechanical', 'Other', 'Sales - Commission', 'Engineer - Electrical', 'Skilled Labor', 'Computer Programmer', 'Other', 'Other', 'Nurse (RN)', "Nurse's Aide", 'Other', 'Retail Management', 'Other', 'Pharmacist', 'Scientist', 'Professional', 'Analyst', 'Military Enlisted', "Teacher's Aide", 'Other', 'Other', 'Sales - Retail', 'Doctor', 'Teacher', 'Professional', 'Skilled Labor', 'Other', 'Accountant/CPA', 'Professional', 'Food Service', 'Food Service', 'Sales - Retail', 'Executive', nan, 'Engineer - Electrical', 'Professional', 'Other', 'Other', 'Professional', 'Other', 'Analyst', 'Truck Driver', 'Other', 'Other', 'Truck Driver', 'Sales - Retail', 'Other', 'Analyst', 'Professional', 'Religious', 'Computer Programmer', 'Other', 'Truck Driver', 'Other', 'Administrative Assistant', 'Other', 'Analyst', nan, 'Other', 'Other', 'Retail Management', 'Food Service', 'Other', 'Other', 'Executive', 'Other', 'Homemaker', 'Engineer - Electrical', 'Engineer - Mechanical', 'Executive', 'Police Officer/Correction Officer', 'Food Service Management', 'Retail Management', 'Professional', 'Executive', nan, 'Truck Driver', 'Homemaker', 'Professional', 'Professional', 'Other', 'Food Service', nan, 'Retail Management', 'Nurse (RN)', 'Analyst', nan, 'Accountant/CPA', 'Other', 'Clerical', 'Professional', nan, 'Analyst', 'Professional', 'Nurse (RN)', 'Analyst', 'Sales - Commission', 'Nurse (RN)', 'Retail Management', 'Computer Programmer', 'Scientist', 'Executive', 'Professional', 'Sales - Commission', 'Other', 'Professional', 'Other', 'Analyst', 'Accountant/CPA', 'Attorney', 'Analyst', 'Other', 'Executive', 'Administrative Assistant', 'Other', 'Retail Management', 'Truck Driver', 'Nurse (RN)', 'Truck Driver', 'Other', 'Tradesman - Mechanic', 'Sales - Retail', 'Executive', 'Tradesman - Electrician', 'Other', 'Clergy', 'Accountant/CPA', 'Other', 'Other', 'Other', 'Laborer', 'Other', 'Other', 'Other', 'Analyst', 'Other', 'Realtor', 'Teacher', 'Doctor', 'Engineer - Electrical', 'Social Worker', 'Professional', 'Teacher', 'Engineer - Mechanical', 'Executive', 'Sales - Retail', 'Professional', 'Executive', 'Professional', 'Nurse (RN)', nan, 'Computer Programmer', 'Professional', 'Fireman', 'Other', nan, 'Student - College Graduate Student', 'Other', 'Religious', 'Analyst', 'Sales - Retail', 'Military Enlisted', 'Professional', 'Professional', 'Accountant/CPA', 'Civil Service', 'Investor', 'Other', 'Realtor', 'Professional', 'Professional', 'Accountant/CPA', 'Truck Driver', 'Truck Driver', 'Other', 'Military Enlisted', 'Skilled Labor', 'Professor', 'Accountant/CPA', 'Attorney', 'Other', 'Other', 'Attorney', 'Other', 'Tradesman - Electrician', 'Truck Driver', 'Skilled Labor', 'Other', 'Pilot - Private/Commercial', 'Other', 'Other', 'Other', 'Professional', 'Computer Programmer', 'Medical Technician', 'Professional', 'Professional', 'Other', 'Computer Programmer', nan, 'Other', 'Realtor', 'Professional', 'Analyst', 'Administrative Assistant', "Nurse's Aide", 'Other', 'Police Officer/Correction Officer', 'Executive', 'Laborer', 'Professor', 'Professional', 'Professional', 'Professional', 'Other', 'Teacher', 'Tradesman - Electrician', 'Other', 'Tradesman - Mechanic', 'Police Officer/Correction Officer', 'Analyst', 'Other', 'Professional', 'Tradesman - Mechanic', 'Other', 'Computer Programmer', 'Food Service Management', 'Police Officer/Correction Officer', 'Accountant/CPA', 'Professional', 'Other', 'Other', 'Analyst', 'Other', 'Other', 'Accountant/CPA', 'Architect', 'Other', 'Other', 'Other', 'Other', 'Teacher', 'Other', "Nurse's Aide", 'Computer Programmer', "Nurse's Aide", 'Analyst', 'Police Officer/Correction Officer', 'Tradesman - Plumber', 'Professional', 'Other', 'Professor', 'Professional', 'Truck Driver', 'Sales - Commission', 'Professional', 'Other', 'Other', 'Doctor', 'Sales - Commission', 'Administrative Assistant', 'Professional', 'Professional', 'Student - College Senior', 'Teacher', 'Retail Management', 'Computer Programmer', 'Executive', 'Other', 'Police Officer/Correction Officer', 'Executive', 'Administrative Assistant', 'Nurse (RN)', 'Other', 'Other', 'Other', 'Professional', 'Engineer - Electrical', 'Other', 'Sales - Retail', 'Nurse (RN)', 'Teacher', 'Clerical', 'Sales - Retail', 'Clerical', 'Tradesman - Mechanic', 'Civil Service', 'Other', 'Other', 'Other', 'Engineer - Mechanical', 'Teacher', 'Teacher', 'Professional', 'Tradesman - Mechanic', 'Other', 'Nurse (RN)', 'Other', 'Retail Management', 'Police Officer/Correction Officer', 'Accountant/CPA', 'Other', 'Sales - Retail', 'Student - Technical School', 'Other', 'Other', 'Food Service Management', 'Professional', "Nurse's Aide", 'Teacher', 'Professional', 'Retail Management', 'Other', 'Other', 'Other', 'Other', 'Professional', 'Sales - Commission', 'Executive', 'Other', 'Truck Driver', 'Other', 'Pilot - Private/Commercial', 'Administrative Assistant', 'Sales - Retail', 'Executive', 'Other', 'Other', 'Other', 'Computer Programmer', 'Other', 'Executive', 'Other', nan, 'Other', 'Other', 'Nurse (LPN)', nan, 'Accountant/CPA', 'Police Officer/Correction Officer', 'Professional', 'Engineer - Mechanical', 'Investor', nan, 'Skilled Labor', 'Military Officer', 'Sales - Commission', 'Social Worker', 'Executive', 'Professional', 'Student - College Senior', 'Professional', 'Laborer', 'Professional', nan, 'Professional', 'Retail Management', 'Other', 'Skilled Labor', 'Teacher', 'Executive', 'Administrative Assistant', 'Engineer - Mechanical', 'Other', 'Other', 'Other', 'Computer Programmer', 'Nurse (RN)', 'Administrative Assistant', 'Other', 'Executive', 'Construction', 'Principal', 'Administrative Assistant', 'Tradesman - Mechanic', 'Other', 'Professional', 'Police Officer/Correction Officer', 'Analyst', 'Other', 'Teacher', 'Fireman', 'Other', 'Professional', 'Executive', 'Teacher', 'Skilled Labor', 'Construction', nan, 'Police Officer/Correction Officer', 'Postal Service', 'Other', 'Professional', 'Other', 'Nurse (LPN)', 'Construction', 'Food Service', 'Police Officer/Correction Officer', 'Other', 'Professional', 'Accountant/CPA', 'Executive', 'Professional', 'Civil Service', 'Construction', 'Analyst', nan, 'Teacher', nan, 'Other', 'Other', 'Sales - Commission', 'Other', 'Clerical', 'Computer Programmer', 'Truck Driver', 'Medical Technician', 'Nurse (RN)', 'Other', 'Other', 'Medical Technician', 'Professional', 'Other', 'Other', 'Accountant/CPA', 'Laborer', 'Other', 'Sales - Commission', 'Other', 'Other', 'Teacher', 'Nurse (RN)', 'Retail Management', 'Administrative Assistant', 'Investor', 'Professional', 'Other', 'Other', 'Computer Programmer', 'Laborer', 'Laborer', 'Teacher', 'Clerical', 'Psychologist', 'Professional', 'Other', 'Retail Management', 'Professional', 'Truck Driver', 'Homemaker', 'Retail Management', 'Food Service Management', 'Other', 'Professional', 'Sales - Retail', 'Principal', 'Other', 'Sales - Retail', 'Computer Programmer', 'Realtor', 'Administrative Assistant', 'Other', 'Sales - Retail', 'Sales - Commission', 'Retail Management', 'Medical Technician', 'Accountant/CPA', 'Computer Programmer', 'Teacher', 'Tradesman - Mechanic', 'Social Worker', 'Retail Management', 'Engineer - Mechanical', 'Engineer - Mechanical', 'Other', 'Retail Management', nan, 'Police Officer/Correction Officer', 'Professional', 'Other', 'Skilled Labor', 'Other', 'Skilled Labor', 'Computer Programmer', 'Fireman', 'Clerical', 'Professional', 'Computer Programmer', 'Civil Service', 'Other', 'Other', 'Professional', 'Computer Programmer', 'Biologist', 'Truck Driver', 'Executive', 'Medical Technician', 'Accountant/CPA', 'Postal Service', 'Other', 'Other', ...]
# check data type
loan_df.Occupation.dtype
dtype('O')
# replace null with unindentified
loan_df.Occupation.fillna('Unidentified', inplace = True)
# check for null records
loan_df[loan_df.Occupation.isnull()]
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | Year |
|---|
0 rows × 82 columns
# convert to category data type
loan_df.Occupation = loan_df.Occupation.astype('category')
# check data type
loan_df.Occupation.dtype
CategoricalDtype(categories=['Accountant/CPA', 'Administrative Assistant', 'Analyst',
'Architect', 'Attorney', 'Biologist', 'Bus Driver',
'Car Dealer', 'Chemist', 'Civil Service', 'Clergy',
'Clerical', 'Computer Programmer', 'Construction', 'Dentist',
'Doctor', 'Engineer - Chemical', 'Engineer - Electrical',
'Engineer - Mechanical', 'Executive', 'Fireman',
'Flight Attendant', 'Food Service',
'Food Service Management', 'Homemaker', 'Investor', 'Judge',
'Laborer', 'Landscaping', 'Medical Technician',
'Military Enlisted', 'Military Officer', 'Nurse (LPN)',
'Nurse (RN)', 'Nurse's Aide', 'Other', 'Pharmacist',
'Pilot - Private/Commercial',
'Police Officer/Correction Officer', 'Postal Service',
'Principal', 'Professional', 'Professor', 'Psychologist',
'Realtor', 'Religious', 'Retail Management',
'Sales - Commission', 'Sales - Retail', 'Scientist',
'Skilled Labor', 'Social Worker',
'Student - College Freshman',
'Student - College Graduate Student',
'Student - College Junior', 'Student - College Senior',
'Student - College Sophomore', 'Student - Community College',
'Student - Technical School', 'Teacher', 'Teacher's Aide',
'Tradesman - Carpenter', 'Tradesman - Electrician',
'Tradesman - Mechanic', 'Tradesman - Plumber',
'Truck Driver', 'Unidentified', 'Waiter/Waitress'],
, ordered=False)
# number of investor value count
loan_df.Investors.value_counts()
1 27814
2 1386
3 991
4 827
5 753
...
912 1
815 1
751 1
623 1
831 1
Name: Investors, Length: 751, dtype: int64
# sorted from highest number of investor
loan_df.Investors.sort_values(ascending=False, ignore_index= True)
0 1189
1 1035
2 1024
3 1011
4 917
...
113932 1
113933 1
113934 1
113935 1
113936 1
Name: Investors, Length: 113937, dtype: int64
# borrower state value count
loan_df.BorrowerState.value_counts()
CA 14717 TX 6842 NY 6729 FL 6720 IL 5921 GA 5008 OH 4197 MI 3593 VA 3278 NJ 3097 NC 3084 WA 3048 PA 2972 MD 2821 MO 2615 MN 2318 MA 2242 CO 2210 IN 2078 AZ 1901 WI 1842 OR 1817 TN 1737 AL 1679 CT 1627 SC 1122 NV 1090 KS 1062 KY 983 OK 971 LA 954 UT 877 AR 855 MS 787 NE 674 ID 599 NH 551 NM 472 RI 435 HI 409 WV 391 DC 382 MT 330 DE 300 VT 207 AK 200 SD 189 IA 186 WY 150 ME 101 ND 52 Name: BorrowerState, dtype: int64
# borrower state sum of null values
loan_df.BorrowerState.isnull().sum()
5515
# check data type
loan_df.BorrowerState.dtype
dtype('O')
# borrower state % of null values
loan_df.BorrowerState.isnull().mean() * 100
4.840394252964358
The borrower state values reveals that the column data type should be categorical with a sum of 5515 missing values representing 4.84% of missing values in the column, replacing the missing value with a unique key word UNKNOWN and changing the data type will suffice.
# fill missing state with unknown
loan_df.BorrowerState.fillna('Unknown', inplace = True)
# convert data type to categorical
loan_df.BorrowerState = loan_df.BorrowerState.astype('category')
# check data type
loan_df.BorrowerState.dtype
CategoricalDtype(categories=['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL',
'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA',
'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE',
'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI',
'SC', 'SD', 'TN', 'TX', 'UT', 'Unknown', 'VA', 'VT', 'WA',
'WI', 'WV', 'WY'],
, ordered=False)
# creating a state dictionary to repalce the abbrev state
state = {'WA': 'WASHINGTON', 'VA': 'VIRGINIA', 'DE': 'DELAWARE',
'DC': 'DISTRICT OF COLUMBIA', 'WI': 'WISCONSIN', 'WV': 'WEST VIRGINIA',
'HI': 'HAWAII', 'FL': 'FLORIDA', 'WY': 'WYOMING',
'NH': 'NEW HAMPSHIRE', 'NJ': 'NEW JERSEY', 'NM': 'NEW MEXICO',
'TX': 'TEXAS', 'LA': 'LOUISIANA', 'NC': 'NORTH CAROLINA',
'ND': 'NORTH DAKOTA', 'NE': 'NEBRASKA', 'TN': 'TENNESSEE',
'NY': 'NEW YORK', 'PA': 'PENNSYLVANIA', 'CA': 'CALIFORNIA',
'NV': 'NEVADA', 'CO': 'COLORADO', 'VI': 'VIRGIN ISLANDS',
'AK': 'ALASKA', 'AL': 'ALABAMA', 'AR': 'ARKANSAS',
'VT': 'VERMONT', 'IL': 'ILLINOIS', 'GA': 'GEORGIA',
'IN': 'INDIANA', 'IA': 'IOWA', 'OK': 'OKLAHOMA',
'AZ': 'ARIZONA', 'ID': 'IDAHO', 'CT': 'CONNECTICUT',
'ME': 'MAINE', 'MD': 'MARYLAND', 'MA': 'MASSACHUSETTS',
'OH': 'OHIO', 'UT': 'UTAH', 'MO': 'MISSOURI',
'MN': 'MINNESOTA', 'MI': 'MICHIGAN', 'RI': 'RHODE ISLAND',
'KS': 'KANSAS', 'MT': 'MONTANA', 'MS': 'MISSISSIPPI',
'SC': 'SOUTH CAROLINA', 'KY': 'KENTUCKY', 'OR': 'OREGON',
'SD': 'SOUTH DAKOTA', 'Unknown': 'UNKNOWN' }
# repalace the abbrev with the dict
loan_df.BorrowerState = loan_df.BorrowerState.map(state)
# check the values
loan_df.BorrowerState.value_counts()
CALIFORNIA 14717 TEXAS 6842 NEW YORK 6729 FLORIDA 6720 ILLINOIS 5921 UNKNOWN 5515 GEORGIA 5008 OHIO 4197 MICHIGAN 3593 VIRGINIA 3278 NEW JERSEY 3097 NORTH CAROLINA 3084 WASHINGTON 3048 PENNSYLVANIA 2972 MARYLAND 2821 MISSOURI 2615 MINNESOTA 2318 MASSACHUSETTS 2242 COLORADO 2210 INDIANA 2078 ARIZONA 1901 WISCONSIN 1842 OREGON 1817 TENNESSEE 1737 ALABAMA 1679 CONNECTICUT 1627 SOUTH CAROLINA 1122 NEVADA 1090 KANSAS 1062 KENTUCKY 983 OKLAHOMA 971 LOUISIANA 954 UTAH 877 ARKANSAS 855 MISSISSIPPI 787 NEBRASKA 674 IDAHO 599 NEW HAMPSHIRE 551 NEW MEXICO 472 RHODE ISLAND 435 HAWAII 409 WEST VIRGINIA 391 DISTRICT OF COLUMBIA 382 MONTANA 330 DELAWARE 300 VERMONT 207 ALASKA 200 SOUTH DAKOTA 189 IOWA 186 WYOMING 150 MAINE 101 NORTH DAKOTA 52 Name: BorrowerState, dtype: int64
# structure of the dataframe
loan_df.shape
(113937, 82)
The structure of the data set is 113,937 rows and 81 columns, implying 113,937 recorded observations with 81 features. The main features of interest to this study include but not limited to the following; loan status, loan term, employment Status, is borrower a homeowner or not, borrower state, income verifiable or not and occupation. To get a better understanding of how this features of interest would be investigated a number of features would support this study which include the following features original loan amount, monthly loan payment, loan current days of delinquency, stated monthly income, investors and recommendations.
In spite of the fact that the dataframe has 81 features, this study is only interested in few of the features, it would be appropriate to shrink the dataframe to the useful columns for the purpose of this study. The following line of code would be executed to extract the necessary column needed for analysis.
# select the column into new df
shrink_loan_df = loan_df[['LoanOriginationDate','Term', 'Year', 'LoanStatus', 'BorrowerState', 'Occupation',
'EmploymentStatus', 'IsBorrowerHomeowner', 'LoanCurrentDaysDelinquent',
'IncomeVerifiable', 'StatedMonthlyIncome', 'LoanOriginalAmount',
'MonthlyLoanPayment','Recommendations', 'Investors']]
# save the shrinked df as a csv file
shrink_loan_df.to_csv('shrink_loan_df.csv', index = False)
# check info of df
shrink_loan_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LoanOriginationDate 113937 non-null datetime64[ns] 1 Term 113937 non-null category 2 Year 113937 non-null category 3 LoanStatus 113937 non-null category 4 BorrowerState 113937 non-null category 5 Occupation 113937 non-null category 6 EmploymentStatus 113937 non-null category 7 IsBorrowerHomeowner 113937 non-null bool 8 LoanCurrentDaysDelinquent 113937 non-null int64 9 IncomeVerifiable 113937 non-null bool 10 StatedMonthlyIncome 113937 non-null int32 11 LoanOriginalAmount 113937 non-null int64 12 MonthlyLoanPayment 113937 non-null int32 13 Recommendations 113937 non-null int64 14 Investors 113937 non-null int64 dtypes: bool(2), category(6), datetime64[ns](1), int32(2), int64(4) memory usage: 6.1 MB
# check null values
shrink_loan_df.isnull().sum()
LoanOriginationDate 0 Term 0 Year 0 LoanStatus 0 BorrowerState 0 Occupation 0 EmploymentStatus 0 IsBorrowerHomeowner 0 LoanCurrentDaysDelinquent 0 IncomeVerifiable 0 StatedMonthlyIncome 0 LoanOriginalAmount 0 MonthlyLoanPayment 0 Recommendations 0 Investors 0 dtype: int64
# more stat info
shrink_loan_df.describe()
| LoanCurrentDaysDelinquent | StatedMonthlyIncome | LoanOriginalAmount | MonthlyLoanPayment | Recommendations | Investors | |
|---|---|---|---|---|---|---|
| count | 113937.000000 | 1.139370e+05 | 113937.00000 | 113937.000000 | 113937.000000 | 113937.000000 |
| mean | 152.816539 | 5.607707e+03 | 8337.01385 | 271.978128 | 0.048027 | 80.475228 |
| std | 466.320254 | 7.478491e+03 | 6245.80058 | 192.701916 | 0.332353 | 103.239020 |
| min | 0.000000 | 0.000000e+00 | 1000.00000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 0.000000 | 3.200000e+03 | 4000.00000 | 131.000000 | 0.000000 | 2.000000 |
| 50% | 0.000000 | 4.666000e+03 | 6500.00000 | 217.000000 | 0.000000 | 44.000000 |
| 75% | 0.000000 | 6.825000e+03 | 12000.00000 | 371.000000 | 0.000000 | 115.000000 |
| max | 2704.000000 | 1.750002e+06 | 35000.00000 | 2251.000000 | 39.000000 | 1189.000000 |
# check df
shrink_loan_df.tail()
| LoanOriginationDate | Term | Year | LoanStatus | BorrowerState | Occupation | EmploymentStatus | IsBorrowerHomeowner | LoanCurrentDaysDelinquent | IncomeVerifiable | StatedMonthlyIncome | LoanOriginalAmount | MonthlyLoanPayment | Recommendations | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 113932 | 2013-04-22 | Medium Term | 2013 | Current | ILLINOIS | Food Service Management | Employed | True | 0 | True | 4333 | 10000 | 364 | 0 | 1 |
| 113933 | 2011-11-07 | Medium Term | 2011 | FinalPaymentInProgress | PENNSYLVANIA | Professional | Employed | True | 0 | True | 8041 | 2000 | 65 | 0 | 22 |
| 113934 | 2013-12-23 | Long Term | 2013 | Current | TEXAS | Other | Employed | True | 0 | True | 2875 | 10000 | 273 | 0 | 119 |
| 113935 | 2011-11-21 | Long Term | 2011 | Completed | GEORGIA | Food Service | Full-time | True | 0 | True | 3875 | 15000 | 449 | 0 | 274 |
| 113936 | 2014-01-21 | Medium Term | 2014 | Current | NEW YORK | Professor | Employed | False | 0 | True | 4583 | 2000 | 64 | 0 | 1 |
# define a func to plot hist
def draw_hist(x, title):
"""plot histogram to show
dist of numeric variable"""
"""param: x, title"""
"""return none"""
plt.figure(figsize=(10,8), dpi = 400)
plt.hist(x = x)
plt.title(title)
plt.xlabel('Amount (Dollars)', fontsize = 10)
plt.ylabel('Distribution', fontsize = 10)
# call func to plot hist of loan original amount
draw_hist(shrink_loan_df.LoanOriginalAmount, 'Histogram Distibution of Loan Original Amount')
The data in the above graph are right-skewed, a case of symmetrical distribution. Most of the loan original amount are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 5000 dollars, the isolated bar indicates an outliers in the ranges between 32000 dollars and 35000 dollars. The data spread is from about 1000 dollars to 3500 dollars.
# call func to plot hist of monthly loan payment
draw_hist(shrink_loan_df.MonthlyLoanPayment, 'Histogram Distibution of Monthly Loan Payment')
The data in the above graph are right-skewed, a case of symmetrical distribution. Most of the monthly loan payment are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 173 dollars. The data spread is from about zero dollars to 2251 dollars.
# define a func to plot kernel
# density estimate
def draw_kde(x, title):
"""plot a kernel density estimate"""
"""param: x, title"""
"""return none"""
plt.figure(figsize=(10,8), dpi = 400)
sns.kdeplot(x = x, data = shrink_loan_df, fill = True)
plt.xlabel('Amount(Dollars)')
plt.title(title)
# plot kernel density estimate for loan original amount
draw_kde('LoanOriginalAmount', 'Kernel Density Estimate for Loan Original Amount')
For example, the probability that a randomly chosen loan original amount will fall between 5000 dollars and 12000 dollars can be calculated as the area between the density function (graph) and the x-axis in the interval [5000, 12000].
# plot kernel density estimate for monthly loan payment
draw_kde('MonthlyLoanPayment', 'Kernel Density Estimate for Monthly Loan Payment')
For example, the probability that a randomly chosen monthly lona payment will fall between 300 dollars and 500 dollars can be calculated as the area between the density function (graph) and the x-axis in the interval [300, 500].
# a func to plot a univariate countplot
def draw_univariate_count_plot(x, title):
"""plot a countplot"""
"""param: x, title"""
"""return none"""
plt.figure(figsize=(8,6), dpi = 400)
ax = sns.countplot(x = x, data = shrink_loan_df)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0, p.get_height()+0.05))
plt.suptitle(title)
plt.xticks(rotation = 90)
plt.plot()
plt.show();
# call func to draw countplot
# of the feature term
draw_univariate_count_plot('Term', 'Term of Loan Distribution.')
Loans disbursed on the medium term in this case 36months has the highest occurence with a count of 87778 representing about 77 percent of loan term duration, leaving the other 23 percent distributed between the long term (60 months) and short term (12months)loan duration.
# call func to draw countplot
# of the feature Employment Status
draw_univariate_count_plot('EmploymentStatus', "Borrower's Employment Status Distribution.")
Those who are employed has the highest occurence in the employment status category with a count of 69557, those who are retired got the lowest occurence in the employment status category, it's more likely to disburse a laon to working class compare to a retired individual.
# call func to draw countplot
# of the feature term
draw_univariate_count_plot('Year', 'Loan Distribution by Year.')
The year 2013 had the highest number of loan disbursment with a occurence of 34345, followed by the year 2012 and 2014 respectively at second and third position, the least loan disbursement occured in the year 2005 with a occurence of 22 loan disbursement.
# call func to draw countplot
# of the feature loan status
draw_univariate_count_plot('LoanStatus', 'Loan Status Distribution.')
The loan status data distribution further indicates that few borrowewrs cancelled their loans and that majority of the borrowers has a current an serviceable loan, those who defaulted in servicing their loan are about 10 percent of those who have current and servicable laon. Those borrowers who are past due has one or more due day.
# call func to draw countplot
# of the feature Borrower's Home Owner Distribution.
draw_univariate_count_plot('IsBorrowerHomeowner', "Borrower's Home Owner Distribution.")
Those borrowers who own a home are slightly above those who do not own a home, the difference here is not quite significant.
# call func to draw countplot
# of the feature Borrower's Verifiable Income Distribution.
draw_univariate_count_plot('IncomeVerifiable', "Borrower's Verifiable Income Distribution.")
Those borrowers whose income is verifiable has the highest occurence, this is quite understandable in this aspect as it is more likely to obtain a loan if the source of income is well define and verifiable, so to speak the probabability of securing a loan if high with a verfiable income.
# import wordcloud
from wordcloud import WordCloud
# each item in borrower state, occupation,
# and employement status column as text to be generated
state_text = " ".join(i for i in shrink_loan_df['BorrowerState'])
occupation_text = " ".join(i for i in shrink_loan_df['Occupation'])
# plot a func to plot wordcloud for
# borrower state, and occupation
def draw_word_cloud(text, title):
"""Plot a word cloud """
"""param: text, title"""
"""return none"""
wordcloud = WordCloud(background_color="white", max_font_size=300, width=1024, height=1000, colormap="magma").generate_from_text(text)
plt.figure(figsize=(20,20))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.suptitle(title, fontsize=30)
plt.show();
# call func to plot borrower state word cloud
draw_word_cloud(state_text, "Borrower's State Word Cloud.")
The image above displays the state where the borrower originates from, with california with a very large fonts indicates how frequent the state is relative to other states. Wyoming and Maine ranked among the least state the borrower originates from, the null values in the state feature was replaced with unknown as we can visualize from the image above.
# call func to plot occupation word cloud
draw_word_cloud(occupation_text, "Borrower's Occupation Word Cloud.")
The image above displays the occupation of the borrowers, with professional and computer programmer with a very large fonts indicates how frequent the occupations are relative to other occupations. community college and labor retail ranked among the least occurence of occupation of the borrower.
# func to plot a boxplot to show distribution
# of numerical variable
def draw_univariate_boxplot(x, title):
"""plot a univariate boxplot"""
"""param:x, title"""
"""return none"""
plt.figure(figsize=(8,6),dpi = 400)
sns.boxplot(x = x, data = shrink_loan_df)
plt.suptitle(title)
plt.show();
# call func to plot stated monthly income dist
draw_univariate_boxplot('StatedMonthlyIncome', 'Stated Monthly Income Distribution.')
The figure above depicts the distribution of the stated monthly income with outliers indicated with an astericks, due to the nature of the data the first quartile and the rest cannot be visualized from this figure.
# call func to plot loan original amount dist
draw_univariate_boxplot('LoanOriginalAmount', 'Loan Original Amount Distribution.')
The figure above depicts the distribution of the loan original amount with skewness to the right and with outliers indicated with an astericks out of the whiskers. The 25th percentile of the loan original amount variable falls slightly before 5000 dollars, while the median falls above the 5000 dollars, the 75th percentile falls short of 15000 dollars.
# call func to plot monthly loan payment dist
draw_univariate_boxplot('MonthlyLoanPayment', 'Monthly Loan Payment Distribution.')
The figure above depicts the distribution of the monthly loan payment with skewness to the right and with outliers indicated with an astericks out of the whiskers. The 25th percentile, median and 75th percentile of the monthly loan payment variable falls short of 500 dollars.
# call func to plot invetors dist
draw_univariate_boxplot('Investors', 'Investors Distribution.')
The figure above depicts the distribution of the investors with skewness to the right and outliers indicated with an astericks out of the whiskers. The 25th percentile, median and 75th percentile of the investors variable falls short of 200 investors.
Summary¶
To analyse the loan with respect the year, the loan origination date column was converted from object datatype to datetime, afterwards the year was extracted from the datetime before setting the data type of the extracted year column as categorical variable, also the loan term values was trasform from the original values 12months, 36months, and 60months to short term, medium term and long term respectively to make for a better behavior as a categorical variable. The loan status has values respresenting past due in a number of categories of days, these values were replaced with a single value named 'past due' regardless of the number of days. The borrower state values were transformed from state abbrevation to full text without leaving out the stated monthy income and monthy loan payment variable out from transformation, these variables were converted from float to integer for consistency with the loan amount data type. The occupation column was transformed from object data type to categorical data type.
For the distribution of term, loans disbursed on the medium term in this case 36months has the highest occurence with a count of 87778 representing about 77 percent of loan term duration, leaving the other 23 percent distributed between the long term (60 months) and short term (12months)loan duration. the loan has distribution over 10 years between 2005 t0 2014, the year 2013 had the highest number of loan disbursement with a occurence of 34345, followed by the year 2012 and 2014 respectively at second and third position, the least loan disbursement occured in the year 2005 with a occurence of 22 loan disbursement. The distibution of the loan original amount is right-skewed, a case of symmetrical distribution. Most of the loan original amount are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 5000 dollars, there are outliers in the ranges of 32000 dollars and 35000 dollars. The distribution of monthly loan payment is right-skewed, a case of symmetrical distribution. Most of the monthly loan payment are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 173 dollars.
# a func to plot a bivariate countplot
def draw_bivariate_count_plot(x, hue, title):
"""plot a countplot"""
"""param: x, hue title"""
"""return none"""
plt.figure(figsize=(8,6), dpi = 400)
ax = sns.countplot(x = x, hue = hue, data = shrink_loan_df)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.15, p.get_height()+0.05))
plt.suptitle(title)
plt.xticks(rotation = 90)
ax.margins(y=0.1)
plt.xlabel('Loan Status', fontsize = 15)
plt.tight_layout()
plt.plot()
plt.show();
# call func to plot laon status by term
draw_bivariate_count_plot('LoanStatus', 'Term', 'Loan Status by Term Duration.')
Apparently the medium term has the highest occurence of loan duration in the distribution from earlier findings, how much does this spread over the loan status category is what the figure above depicts, for the current loan status 20127 were on long term, 36387 were on medium term while just 62 were on the short term. The past due category has a spread of 684 long term duration, 1373 medium term duration, and 10 short term duration. other categories can be visualized from the figure above.
# call func to plot laon status by home ownership
draw_bivariate_count_plot('LoanStatus', 'IsBorrowerHomeowner', 'Loan Status by Borrower Home Ownership.')
The home ownership variable has the true and false values evenly distributed from earlier findings, how much does this spread over the loan status category is what the figure above depicts, for the current loan status 30478 were true(home owner), while 26098 were false(no home ownership). The past due category has a spread of 1003 home ownership, and 1064 no home ownership. The completed loan status category has home ownership of 18280 while 19794 does not own a home.
# call func to plot laon status by verifiable income
draw_bivariate_count_plot('LoanStatus', 'IncomeVerifiable', 'Loan Status by Verifiable Income')
The verifiable income variable has the true value as the highest occurence in the distribution from earlier findings, how much does this spread over the loan status category is what the figure above depicts, for the current loan status 52434 has true (verifiable income) , while 4142 were false(no verifiable income). The past due category has a spread of 1857 verifiable income borrowers, and 210 non income verifiable borrowers. The completed loan status category has verifiable income borrowers of 35292 while 2792 does not have a verifiable income.
# def a func to plot Loan Original Amount and
# Stated Monthly Income group by a feature
def draw_bi_bar_plot(by, groupby, title):
"""plot a barplot on groupby function"""
"""param: by, groupby, title"""
"""return ax, mean"""
ax = shrink_loan_df.groupby(by= by)[['LoanOriginalAmount', 'StatedMonthlyIncome']].mean().plot(kind = 'bar', figsize = (20,10), fontsize = 15)
mean = shrink_loan_df.groupby(by = groupby).mean()[['LoanOriginalAmount', 'StatedMonthlyIncome']]
plt.xticks(rotation = 360)
plt.title(title, fontsize = 25)
plt.ylabel('Mean Values', fontsize = 20)
plt.xlabel(by, fontsize = 20)
return ax, mean;
# calling the function to group by year
draw_bi_bar_plot('Year', 'Year', 'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Year.')
(<AxesSubplot:title={'center':'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Year.'}, xlabel='Year', ylabel='Mean Values'>,
LoanOriginalAmount StatedMonthlyIncome
Year
2005 3576.681818 11122.727273
2006 4763.325262 4743.353708
2007 7049.545026 4653.942321
2008 6021.628289 4618.605869
2009 4354.858818 5091.406448
2010 4766.540340 5290.192675
2011 6692.021108 5659.263359
2012 7833.842173 5709.352376
2013 10545.065599 6156.148231
2014 11912.219520 6329.746796)
The loan original amount has the highest mean value of the year 2014 with a lowest mean value in the year 2005, for the stated monthly income the year 2005 has the highest mean value with year 2008 with the lowest mean value.
# calling the function to group by loan status
draw_bi_bar_plot('LoanStatus', 'LoanStatus', 'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Loan Status.')
(<AxesSubplot:title={'center':'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Loan Status.'}, xlabel='LoanStatus', ylabel='Mean Values'>,
LoanOriginalAmount StatedMonthlyIncome
LoanStatus
Cancelled 1700.000000 2608.600000
Chargedoff 6398.916694 4485.688793
Completed 6189.093239 5324.206204
Current 10360.835018 6152.938667
Defaulted 6486.798525 4366.463930
FinalPaymentInProgress 8346.121951 6311.419512
Past Due 8258.437349 5366.540397)
The loan original amount has the highest mean value in the current loan status category and a lowest mean value in the cancelled loan category, for the stated monthly income the loan status final payment in progress has the highest mean value with loan status cancelled with the lowest mean values.
# calling the function to group by verifiable income
draw_bi_bar_plot('IncomeVerifiable', 'IncomeVerifiable', 'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Verifiable Income.')
(<AxesSubplot:title={'center':'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Verifiable Income.'}, xlabel='IncomeVerifiable', ylabel='Mean Values'>,
LoanOriginalAmount StatedMonthlyIncome
IncomeVerifiable
False 7111.896413 5022.028838
True 8437.904368 5655.938519)
The loan original amount has the highest mean value in true value of verifiable income variable, and a lowest mean value in the false value of verifiable income variable, for the stated monthly income variable the highest mean value is true value of verifiable income variable with a lowest mean value in the false value of verifiable income variable.
# define a func to plot lineplot
def draw_line_plot(x, y, title):
"""plot a line plot"""
"""param: x, y, title"""
"""return none"""
plt.figure(figsize=(8,6), dpi= 400)
p = sns.lineplot(x = x, y = y, data = shrink_loan_df)
plt.title(title);
# call func to plot lineplot of stated monthly
# income against loan original amount
draw_line_plot('StatedMonthlyIncome', 'LoanOriginalAmount', 'Line Graph Depicting Relatioship Between Stated Monthly Income and Loan Original Amount.')
The above figure illustrates that majority of the values fall between 0 doallrs and 25000 dollars, as this indicates no positive correlation between the two variables.
# call func to lineplot of monthly
# loan payment against loan original amount
draw_line_plot('LoanOriginalAmount', 'MonthlyLoanPayment', 'Line Graph Depicting Relationship Between Monthly Loan Payment and Loan Original Amount.')
The above figure illustrates a positive correlation between the two variables, as the original loan amount increases the monthly loan payment increase relatively.
Summary¶
The verifiable income variable has the true value as the highest occurence in the distribution from earlier findings, the spread over the loan status category on the current loan status can be figure out as 52434 has true (verifiable income), while 4142 were false(no verifiable income). The past due category has a spread of 1857 verifiable income borrowers, and 210 non income verifiable borrowers. The completed loan status category has verifiable income borrowers of 35292 while 2792 does not have a verifiable income. Apparently the medium term has the highest occurence of loan duration in the distribution from earlier findings, the spread over the loan status category on the current loan status can be figure out as 20127 were on long term, 36387 were on medium term while just 62 were on the short term. The past due category has a spread of 684 long term duration, 1373 medium term duration, and 10 short term duration. The home ownership variable has the true and false values evenly distributed from earlier findings, the spread over the loan status category on the current loan status can be figure out as 30478 were true(home owner), while 26098 were false(no home ownership). The past due category has a spread of 1003 home ownership, and 1064 no home ownership. The completed loan status category has home ownership of 18280 while 19794 does not own a home. There exist a positive correlation between the two variables, as the original loan amount increases the monthly loan payment increase relatively.
# define func to plot scatter plot loan original
# amount against current days of delinquency
def draw_scatter(hue, title):
"""plot a scatterplot"""
"""param: hue, title"""
"""return none"""
plt.figure(figsize=(10,8), dpi = 400)
sns.scatterplot(x = 'LoanOriginalAmount', y = 'MonthlyLoanPayment', hue = hue, data = shrink_loan_df)
plt.xlabel('Loan Original Amount(Dollars)')
plt.ylabel('Monthly Loan Payment(Dollars)')
plt.title(title);
# call func to plot scatter plot group by loan term
draw_scatter('Term', 'Original Loan Amount Against Current Days of Delinquency Group by Loan Term.')
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by term of loan.
# call func to plot scatter plot group by year
draw_scatter('Year', 'Original Loan Amount Against Current Days of Delinquency Group by Loan Year.')
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by year of loan.
# call func to plot scatter plot group by verifiable income
draw_scatter('IncomeVerifiable', 'Original Loan Amount Against Current Days of Delinquency Group by Verifiable Income.')
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by verifiable income.
# call func to plot scatter plot group by borrower home ownership
draw_scatter('IsBorrowerHomeowner', 'Original Loan Amount Against Current Days of Delinquency Group by Borrower Home Ownership.')
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by home ownership.
# call func to plot scatter plot group by employment status
draw_scatter('EmploymentStatus', 'Original Loan Amount Against Current Days of Delinquency Group Employment Status.')
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by employment status.
# call func to plot scatter plot group by loan status
draw_scatter('LoanStatus', 'Original Loan Amount Against Current Days of Delinquency Group Loan Status.')
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by loan status.
# plot a heatmap to show correlation
plt.figure(figsize=(10,8), dpi = 400)
colormap = sns.color_palette('Greens')
sns.heatmap(shrink_loan_df.corr(), annot = True, cmap = colormap, center = 0)
plt.title('Correlation Matrix Depicting Relationship Between Variable with Heatmap.')
Text(0.5, 1.0, 'Correlation Matrix Depicting Relationship Between Variable with Heatmap.')
From the figure above we can deduce the correlation by the heatmap, we could vizualize we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them.
Summary¶
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across a scatterplot categorized by term of loan, year, verifiable income, home ownership, loan status, and employment status plotted on different scatter plot to better depict the spread. We could also deduce from the findings the correlation of continuous numerical variable by the heatmap, we could see we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them.
Before carrying out this study a number of transformation was carried out from the gathered dataset, to begin with the data cleaning process the loan origination date column was converted from object datatype to datetime, afterwards the year was extracted from the datetime before setting the data type of the extracted year column as categorical variable, also the loan term values was trasform from the original values 12months, 36months, and 60months to short term, medium term and long term respectively to make for a better behavior as a categorical variable. The loan status has values respresenting past due in a number of categories of days, these values were replaced with a single value named 'past due' regardless of the number of days. The borrower state values were transformed from state abbrevation to full text without leaving out the stated monthy income and monthy loan payment variable out from transformation, these variables were converted from float to integer for consistency with the loan amount data type. The occupation column was transformed from object data type to categorical data type. Owing to the fact that the features are 81 in number the variables of interest for te purpose of this study were pulled together into a new new dataframe, saved as a comma-seperated values to be referenced for exploration and visualization.
For the distribution of term, loans disbursed on the medium term in this case 36months has the highest occurence with a count of 87778 representing about 77 percent of loan term duration, leaving the other 23 percent distributed between the long term (60 months) and short term (12months)loan duration. the loan has distribution over 10 years between 2005 t0 2014, the year 2013 had the highest number of loan disbursement with a occurence of 34345, followed by the year 2012 and 2014 respectively at second and third position, the least loan disbursement occured in the year 2005 with a occurence of 22 loan disbursement. The distibution of the loan original amount is right-skewed, a case of symmetrical distribution. Most of the loan original amount are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 5000 dollars, the isolated bar indicates an outliers in the ranges between 32000 dollars and 35000 dollars. The distribution of monthly loan payment is right-skewed, a case of symmetrical distribution. Most of the monthly loan payment are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 173 dollars. Finally, We could also deduce from the findings the correlation of continuous numerical variable by the heatmap, we could see we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them.
The income verifiable variable has the true value as the highest occurence in the distribution from earlier findings, the spread over the loan status category on the current loan status can be figure out as 52434 has true (verifiable income), while 4142 were false(no verifiable income). The past due category has a spread of 1857 verifiable income borrowers, and 210 non income verifiable borrowers. The completed loan status category has verifiable income borrowers of 35292 while 2792 does not have a verifiable income. Apparently the medium term has the highest occurence of loan duration in the distribution from earlier findings, the spread over the loan status category on the current loan status can be figure out as 20127 were on long term, 36387 were on medium term while just 62 were on the short term. The past due category has a spread of 684 long term duration, 1373 medium term duration, and 10 short term duration. The home ownership variable has the true and false values evenly distributed from earlier findings, the spread over the loan status category on the current loan status can be figure out as 30478 were true(home owner), while 26098 were false(no home ownership). The past due category has a spread of 1003 home ownership, and 1064 no home ownership. The completed loan status category has home ownership of 18280 while 19794 does not own a home. There exist a positive correlation between the two variables, as the original loan amount increases the monthly loan payment increase relatively. From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across a scatterplot categorized by term of loan, year, verifiable income, home ownership, loan status, and employment status plotted on different scatter plot to better depict the spread. We could also deduce from the findings the correlation of continuous numerical variable by the heatmap, we could see we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them.